How to join tabular(CSV) data to a shapefile


GIS in Python: Table Join

Joining Tabular data to a shapefile allows you to add fields from your data table to a shapefile’s attribute table based on matching values found in the key columns, so that you may generate a choropleth map of your variable of interest. For example, you would like to locate which area in New York state has the largest concentration of African Americans living below the poverty level. However, you only have a New York state census tracts shapefile without the attribute field for the Black population below the poverty level and a tabular file including the poverty population data at census tract level downloaded from the U.S. Census. Let’s conduct this analysis using Python and the GeoPandas package. You can use the read_file() method from the geopandas package to read shapefile and use the read_csv() method from the pandas package to read the tabular dataset that you would like to join with the shapefile into Python. Then use the merge() method to join the tabular data with the shapefile based on common columns, and use the matplotlib to show a choropleth map of the interested variable. Here is the data for this exercise.


Part A: Install and Launch Jupyter Notebook via Anaconda

If you already have Anaconda downloaded and installed, you can skip Part A and directly start the analysis in Part B. Make sure you also have packages pandas, geopandas, matplotlib, descartes, and mapclassify installed in the environment where you would like to conduct this analysis.

1) First, download Anaconda. Anaconda is a free and open-source distribution of Python. You can use Anaconda to install IDEs (integrated development environments where you can write and run code) and packages like Pandas and Geopandas. Go to the link to download Anaconda, https://www.anaconda.com/products/individual, and then open the .exe file that was downloaded and follow the instructions in the installation wizard prompt.


2) Once installation is complete, open Anaconda Navigator and create a new environment for your project. A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. Conda has a default environment called 'base' that includes a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base environment, and, instead, create an isolated environment to manage packages and dependencies in a new project.

Click on the Environments selection in the left sidebar menu and then click on the 'Create' at the bottom. This will open a dialog box prompting you to create a name for the new environment. You can give any name for your new environment. Here, we use 'GIS_in_Python' as the environment name. Then click the 'Create' button within the dialog box to finish the creation.


3) Once you have your project environment set up, click on the arrow to the right of your new environment, 'GIS_in_Python' in this example, and select Open Terminal. This will give you access to the command line interface on your computer in a window.


4) Install the packages/libraries necessary for the analysis by entering the following commands in the opened terminal, one line at a time:
conda install pandas
conda install geopandas
conda install matplotlib
conda install descartes
conda install mapclassify


5) Once you have those libraries all installed, select the new environment, 'GIS_in_Python' in this example, in the 'Applications on' dropdown menu, and then click "install" and "launch" under Jupyter Notebook. Jupyter Notebook will open in your web browser (it does not require the internet to work).


6) In Jupyter Notebook, navigate to the folder where you saved the code file you plan to use and open the .ipynb file (the extension for Jupyter Notebook files written in Python) to run it in the Notebook. If you would like to create a new .ipynb file, browse to the folder in which you would like to save your Notebook, then click the "New" dropdown button on the top-right and select "Python 3". Your new Notebook will open in a new tab in your browser. If you want to create a new directory using the Jupyter Notebook dashboard, click the "New" dropdown button and then select "Folder". To add files from your local machine, click the "Upload" button on the top-right to open a file chooser window and then choose the file you wish to upload.


Part B: Read Data File and Perform a Table Join and other Spatial Analysis

1) Import necessary packages/libraries.


2) Use the gpd.read_file() function from the geopandas package to read the shapefile. Optionally, you can use the head() method to return the first 5 rows of the GeoDataFrame.


You may also use matplotlib for plotting to generate an overview of your GeoDataFrame.


3) Use the read_csv() method from the pandas package to read the tabular dataset that you would like to join with the shapefile. Make sure the tabular dataset has at least one column in common with the shapefile so that you can use it as a key column to join the two. For our tutorial the key column is 'GEOID'. Optionally, you can use the head() method to return the first 5 rows of the DataFrame.


4) Before joining, use the dtypes attribute to check if the data types of the common columns in the two data frames are the same and use the astype() method to convert data types to be the same if they are different.
In this example, 'GEOID' is the column we want to examine in the two datasets (NYS_tracts['GEOID'] and poverty['GEOID']) for their data types (.dtypes). The data type of 'GEOID' in the 'NYS_tracts' GeoDataFrame is 'object' which represents strings while the data type of GEOID in the 'poverty' DataFrame is 'int64', which means that the two columns aren’t compatible, so we must convert the data types. We convert 'GEOID' in the 'NYS_tracts' GeoDataFrame (NYS_tracts['GEOID']) from strings to integer (.astype('int64')).


5) Once you’ve had shapefile and tabular data both ready in the session, you can use the merge() method to join the two datasets based on the field that is common to both. In this example, the GeoDataFrame 'NYS_tracts' is joined with the pandas DataFrame 'poverty' based on the common variable 'GEOID'.

Optionally, you can use the head() method to return the first 5 rows of the 'join' GeoDataFrame.


6) You can now use matplotlib to generate a choropleth map of the variable of your interest. For this example, based on the map of 'NYS_tracts' we generated in the Step 2, further specify that:


You can also sort the table by values in the column of interest using sort_values() method.

In conclusion, we can see that in the choropleth map accompanied with the sorted table above, Kings County concentrates many census tracts that have a large number of Black and African American living below poverty level in 2019.